InfoVis 2003 Contest - InfoZoom Entry

Michael Spenke, Christian Beilken
{Michael.Spenke,Christian.Beilken}@fit.fraunhofer.de
FIT - Fraunhofer Institute for Applied Information Technology 

See Infovis 2003 Contest rules and task at http://www.cs.umd.edu/hcil/iv03contest/

Ratings used below: (Strength,Possible,Difficult,Not Available)

Pairwise comparisons of trees: Topological changes

Did anything change, in general, or in a subtree?

Rating:
Strength
Process:
A simple side-by-side comparison of two or more trees gives a first impression of the differences. 
Marking a cell in one tree also marks it in the other tree. 
This makes it easy to compare the size of corresponding cells. (First image)

In order to compare two or more trees precisely we first need a mapping between the nodes of the trees,
which defines when two nodes in different trees are regarded as identical.

In the animal classification trees each animal has a Latin name which is unique within a tree.
Therefore, we consider two animals in different trees as identical, if and only if they have the same Latin name.
Using the the derived attribute Count(Tree) per Latin Name we can exactly determine which animals are found in both trees
and which are only contained in one of the trees.
This is explained in detail further below in the answer to the application specific question 
"To what extent are the differences in the classifications due to differences in how animals are thought to be related?".

The attribute Latin Path contains the full path name of each animal. 
Therefore, the derived attribute Count(Latin Path) per Latin Name can be used to find all animals that are differently classified in the two trees.
This is also explained in detail further below in the answer to the question mentioned above.


In the file system logs there is no unique identification of a file besides the full path name.
Consequently, it is impossible to find files that were renamed or moved to another directory.
We can only see that some files are missing in a later snapshot and new files have appeared.
These files can be exactly determined using the derived attributes List(week) per file path and
Count(week) per file path. We simply zoom on all files which do not appear in all 5 snapshots.
(Second, third, and fourth image)

Another approach to find the differences between the snapshots is based on the creation and modification times of the files.
This is explained in detail further below in the answer to the application specific question 
"Were there a lot of pages created recently? If so, in which part of the file system?"
Image:
AnimalsSideBySide.JPG
The two animals trees side by side

CountWeekPerFile.JPG
Most files are found in all 5 snapshots.

ListWeekPerFile.JPG
Some files are found in only 4 or less snapshots.

DiffFiles.JPG
Overview of the files found in only 4 or less snapshots.
Answer:

What nodes were added, deleted?

Rating:
Strength
Process:
We can exactly determine the nodes that are contained in one tree but not in the other.
For details see previous question.
Image:
See previous question
Answer:
See previous question

Did any node or subtrees "move" in the tree? Can you characterize those movements?

Rating:
Difficult
Process:
Using the derived attribute Count(Latin Path) per Latin Name,
we can exactly determine the animals that are contained in both trees but with a different classification.
For details see first question.

It is not, however, possible to automatically decide if the differences found are the result of the move of a complete subtree.
Some manual browsing is necessary here.
Image:
See first question
Answer:
See first question

Pairwise comparisons of trees: Attribute value changes

Global impression: did things change a lot or not?

Rating:
Strength
Process:
We define the derived attributes Average(hitCount) per week and Average(size in KB) per week and sort by week.
Image:
AverageHitCount.JPG

Answer:

What nodes or subtrees changed the most?

Rating:
Strength
Process:
To answer this question, we defined a derived attribute Minimum(hitCount) per file path / Maximum(hitCount) per file path.
After a zoom on the highest values of this attribute, we see the files with the highest increase rates of hitCount within the five snapshots.
Image:
image013.png
Answer:
See image.


Did the value of attribute XYZ for this node increase or decrease? In absolute terms, or relatively to other siblings or other nodes.

Rating:
Strength
Process:
We zoomed on the file /index.html in all 5 weeks.
Then we sorted the table by week.
Image:
Increase.JPG

Answer:
The values for attribute hitCount show an increase from week B to E.


General visualization of trees: Topology

Overall characteristics: How large is the tree? How many levels deep? What is the deepest branch? Does the depth vary between subtrees or not?

Rating:
Possible
Process:
Image:
PathLength.JPG
Latin classification

CommonClassification.jpg
Common classification
Answer:

Path: What is the path of this node?

Rating:
Strength
Process:
The path of an animal is given by its attribute values and also by its attribute Latin Path.
Image:
Path.JPG
Answer:
Not applicable



Local relatives: What are the children, siblings, or cousins of this node?

Rating:
Strength
Process:
This can be directly seen.
Image:
Siblings.JPG
Answer:

Filtering by level: Show only the first level, or show only 3 levels down, or remove all the leaves

Rating:
Strength
Process:
Image:
HiddenAttributes.JPG
Levels below Class are hidden

Procection.JPG
Projection on the top 5 levels
Answer:


Topologies question that involve counting nodes can be seen as attribute dependant questions: e.g. Which branch contains the largest number of nodes? or Which branch has the largest fan-out?

Rating:
Strength
Process:
Just look for the widest cells in each row. If you are unsure: Open the value menu and sort by frequency. Shown in the image for Order.
Image:
LargestFanout.jpg
Answer:
The subtrees with the largest fanouts are the Phylum Arthropoda, the Subphylum Hexapoda, the Class Insecta, the Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera with 15117 entries.


General visualization of trees: Attribute based

Find nodes with high values of a numerical attribute X? (relative query)

Rating:
Strength
Process:
Switch to Overview Mode.
Select a range at the right end of the attribute's row and zoom in.
Repeat until the the rightmost cell is large enough to display the value.
Alternatively, open the value list dialog and sort it backwards. The highest value is displayed on top
Image:
HighNumericValues.jpg
Answer:
Not applicable


Find nodes with given value of a numerical attribute X? (absolute query)

Rating:
Strength
Process:
Switch to Overview Mode.
If the value can be already seen, just double-click it.
Otherwise select a rough range around the value and zoom in, possibly in several steps.
Alternatively, the value can be directly selected in the value list dialog, which might be very long, however.
Image:
NumericValues.jpg
Answer:
Not Applicable


Find nodes with value Y of categorical attribute X - What value of a categorical attribute occurs more often? e.g. Are there more farm animals or pets?

Rating:
Strength
Process:
In the Overview Mode the value distribution of all attributes is shown.
The width of each cell is proportional to the number of files with this value.
For cells that are too small we can lookup the size in the value list dialog.
The value list can be sorted by frequency, so that the largest values/cells are on top.
Image:
Overview.jpg
Answer:
html, gif, and jpg are the most frequent file formats.


Find nodes with certain values of two or more attributes (What video file is used the most?)

Rating:
Strength
Process:
Open the value list dialog of attribute extension.
Select the video formats like avi, mpg, mov and zoom in.
Zoom on the highest values of hitCount.
Image:
VideoFiles.jpg
Answer:
 /projects/hcil/kiddesign/icdl/icdl.mpg is used the most.



Number of nodes in a tree or subtree? (How many animals? How many mammals?)

Rating:
Strength
Process:
There are several possibilities
Image:
CountInsecta.jpg
Answer:
The subtree of Insecta contains 64423 animals.


Comparison of branches of the tree (Subtrees with most nodes; are there more mammals or fish?)

Rating:
Strength
Process:
Image:
FishAndMammals.jpg
Answer:
There are more bony fishes than mammals.


Largest fanout (What is the largest group of animals with same lineage?

Rating:
Strength
Process:
Just look for the widest cells in each row. If you are unsure: Open the value menu and sort by frequency. Shown in the image for Order.
Image:
LargestFanout.jpg
Answer:
The subtrees with the largest fanouts are the Phylum Arthropoda, the Subphylum Hexapoda, the Class Insecta, the Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera with 15117 entries.


General visualization of trees: Known items

Which nodes have a particular string in their label? (Find "giraffe" in a tree of animals)

Rating:
Strength
Process:
We perform a full-text search in all attributes for "giraffe".
 This zooms on all animals that contain "dolphin" in at least one of its attributes.
Image:
giraffe.jpg
Answer:
There are only two animals with giraffe in their common names.


Locate a node knowing its path

Rating:
Strength
Process:
Just click onto the cells containing the next label. If the cell is too small select a range containing the label or use the value menu.
Image:
Not Applicable
Answer:
Not Applicable


Go back to a node you have visited before

Rating:
Strength
Process:
There are several techniques:
Image:
Not Applicable
Answer:
Not Applicable


General visualization of trees: Labeling

Review all the labels in a subtree

Rating:
Strength
Process:
First we zoom on the subtree, e.g. Insecta.
For each rank we can get a popup-window with a list of all labels.
Image:
AllLabels.jpg
Answer:
The image shows a list of all species in the Insecta subtree.


General visualization of trees: Browsing

Explore the tree by performing a series of up and downs in the tree

Rating:
Strength
Process:
This is done by zoom-in and zoom-out operations. 
Video:
Click to see video
Answer:
Not Applicable


General visualization of trees: Managing the analysis

Marking nodes of interest

Rating:
Possible
Process:
Any subset of the displayed cells can be marked (selected) using the mouse (click, drag, ctrl-click, shift-click).
However, the marking is quite volatile: The next mouse click into the table will remove it.
Another way to mark a set of records is to create a new attribute interesting and to set its values to yes or no.
This rests on the fact that InfoZoom is also a very powerful editor:
We can simply select a cell or a range of cells and directly edit its values like in a spreadsheet.
The modification is performed in all records represented by the cell.
In this way, thousands of records can be modified in a single operation.

Once the attribute interesting is defined, we can later zoom on just the interesting values.
Image:
Interesting.jpg
Answer:
Not Applicable


Removing special anomalies

Rating:
Strength
Process:
InfoZoom is also a very powerful editor.
We can simply select a cell and directly edit its value like in a spreadsheet.
The modification is performed in all records represented by the cell.
In this way, thousands of records can be modified in a single operation.
Image:
DiffClassification.jpg
Animals with different classifications in the two trees
 Answer:
We experimentally cleared the selected cells in the above image.
Afterwards about 1000 animals did not have different classifications anymore.


Saving visualization settings for future reference

Rating:
Strength
Process:
The navigation history is stored as a sequence of commands.
A command sequence can be stored as a named query.
In order to perform a query later, InfoZoom executes the stored navigation commands.
Image:
Queries.jpg
Answer:
Not Applicable


Keeping the history of your analysis, reviewing it and replaying it with different parameters

Rating:
Strength
Process:
The navigation history is stored as a sequence of commands. 
Using the back and forward buttons we can get an animated replay of our interaction.
The buttons also have an associated menu that shows the command history.
It can be used to jump directly to a saved state.

Image:
NavigationHistory.jpg
Answer:
Not Applicable


Phylogenies: Application specific tasks

This data set was not analyzed with InfoZoom.

Classifications: Application specific tasks

To what extent are the differences in the classifications due to differences in how animals are thought to be related? Are there other kinds of differences and can you explain them?

Rating:
Strength
Process:
There are two kinds of differences:
 These differences can be exactly determined:

Latin Name uniquely identifies an animal within each tree.
We define a derived attribute Count(Tree) per Latin Name.
The resulting value is 2 for most of the animals.
This means that they are contained in both of the trees.
But some animals are found in only one tree. 
We can zoom on them by clicking on the 1.

The derived attribute Latin Path is the full path name of each animal.
It is similar to a fully qualified file name.
We define Count(Latin Path) per Latin Name and zoom on the animals where the result is 2.
These have a different classification in the two trees.

We can also use color coding in order to highlight the areas of the overall trees where there are different classifications.
To achieve this, we specify that the attribute Count(Latin Path) per Latin Name defines the coloring.
Each cell is now colored according to the average value of Count(Latin Path) per Latin Name of  the animals it represents.
Therefore, red areas have less differences than the average, green areas contain more differences.

We also defined several attributes like
in order to spot the differences more precisely.
Image:
DiffAnimals.jpg
Animals contained in only one of the trees

DiffClassification.jpg  
Animals with a different classification in the two trees

ColorCodingDiffs.jpg
Color Coding of the frequency of different path names
Answer:

Can you say in how many different subtrees a particular common name (such as "dolphin" or "horse") is used? How closely are these animals related? Are common names a good guide to understanding relationships?

Rating:
Strength
Process:
We perform a full-text search in all attributes for "dolphin".
 This zooms on all animals that contain "dolphin" in at least one of its attributes.
Image:
FindDialog.jpg
Find-Dialog

Dolphins.jpg
Result of full-text search for "dolphin"

Horses.jpg
Result of full-text search for "horse"

Answer:

How many species or subspecies are named after biologists named "Townsend"?

Rating:
Strength
Process:
We perform a full-text search in Latin Name and Common Name for "townsend".
Image:
Townsend.jpg
Answer:
In Tree A there are 48 Latin Names and 15 Common Names which contain "townsend".

What kind of feedback does your tool provide to alert the user quickly when a wrong name is entered?

Rating:
Strength
Process:
We perform a full-text search in all attributes for "Spirurida" and then "Spirulida".
Image:
Spirurida.jpg
Result of full-text search for "Spirurida"

Spirulida.jpg
Result of full-text search for "Spirulida"
Answer:
The first image clearly shows that the expected result was not obtained.

For the top five subtrees with the most nodes-- are they likely to have a parent of a particular rank? Or does this happen in many ranks? Can you comment on how useful "rank" is?

Rating:
Strength
Process:
We do not completely understand the question.
We try to answer it anyway.

The size of subtrees is proportional to the width of the cells.
Looking at the complete tree A we see several large cells at different levels.
Image:
LargeSubtrees.jpg

5LargestSubtrees.jpg
Answer:

File system and usage logs: Application specific tasks

Introduction

As with the animal tree, we had to transform the XML files to an object/attribute table.
Each leaf of the tree, i.e. each file, constitutes a column of the table.
Each row of the table corresponds to a file attribute.

The inner nodes are the directories. They are also represented as attributes of each file:

Five complete snapshots of the file system have been taken at the end of weeks A to E.
They were all combined into one large table.

In order to get a first overall impression of the whole data set, we start in InfoZoom’s Overview Mode.
Other than in the Compressed Table Mode, the data set is not visualized as a table here.
Instead each row independently shows the value distribution of an attribute.
The size of each cell is proportional to number of files with that value.

  image001.png  

The following observations can be made:

 Browsing is performed by interactively zooming into sub areas.

This is demonstrated in a video.

Click to see video

For example, we can double-click the cell containing the value A, to zoom on the first snapshot only.
The screen shot shows the result after a second zoom on
projects and a switch to the Compressed Table Mode:

image004.jpg

 We can observe the following:

Where are the big directories?

There are several different interpretations of this question:
  1. Which toplevel directories do contain the highest number of files?
  2. Which toplevel directories do occupy the most disk space?
  3. Which directories do directly contain the highest number of files?
  4. In which directories do the directly contained files occupy the most disk space?

Interpretation A: Which toplevel directories do contain the highest number of files?

Alternative 1:

Rating:
Strength
Process:
Big directories are immediately visible since the width of each cell is proportional to the number of files it represents.
Image:
BigDirectories1.jpg
Answer:

Alternative 2:

Rating:
Strength
Process:
The value list for toplevel directory can be sorted by frequency. The frequency is identical to the number of files.
Image:
    BigDirectories5.jpg
Answer:

Interpretation B: Which toplevel directories do occupy the most disk space?

Alternative 1:

Rating:
Strength
Process:
Define a derived attribute Sum(size in KB) per toplevel directory and exclude the unknown sizes.
Image:
    BigDirectories2.jpg
Answer:

Alternative 2:

Rating:
Strength
Process:
Declare size in KB as the attribute that determines the column width of the table. 
Normally, all columns have the same width, and therefore the width of each cell is proportional to the number of files it represents. 
In the image below, however, the width of each cell is proportional to the total size of the files it represents.
Image:
image005.png

Answer:

Interpretation C: Which directories do directly contain the highest number of files?

Rating:
Strength
Process:
Define a derived Count(file path) per directory path and zoom on its highest values.
Image:
    BigDirectories6.jpg
Answer:

Interpretation D: In which directories do the directly contained files occupy the most disk space?

Rating:
Strength
Process:
Define a derived Sum(size in KB) per directory path and zoom on its highest values.
Image:
    BigDirectories3.jpg
Video:  Click to see Video
Answer:

Can you see different patterns in the files? (Can you make out the difference between personal pages, class pages and research project pages?)

Rating:
Strength
Process:
Zoom on each of the four largest toplevel directories.
Image:

Usershollings.jpg
Toplevel directories

Users.jpg
Toplevel directory users

Projects.JPG
Toplevel directory projects

Class.jpg
Toplevel directory class

Answer:

Were there a lot of pages created recently? If so, in which part of the file system?

Rating:
Strength
Process:
We used the creation and modification times of the files to answer this question. 
First of all we defined the derived attribute mtime >= ctime. Somewhat surprising the result was false in most cases. 
On the other hand ctime >= mtime is true in 99.9% of the cases. (There are 59 exceptions!?) 
So obviously the two attributes ctime and mtime had been swapped by mistake in the data set.
We corrected this error and defined a few attributes derived from ctime and mtime:
image010.jpg
It can be seen that the vast majority of the files where created and modified before the first snapshot. 
Most files where even modified at the same day in August 2002!

In order to answer the question we zoomed on all files with a creation day 2003/1/25 or later in snapshot E.
Image:
RecentlyCreated.jpg
Answer:

Are the newer directories bigger than the older projects?

We compared the number of files in the subdirectories of project in week A and E

Rating:
Strength
Process:
We defined a new attribute Count(file path) per level2 and week
Next we zoomed on the weeks A and E and then on projects.
Image:
CompareProjects.jpg
Answer:
The image shows the three biggest directories. None of them grew between A and E.

When was the page giving directions to the department last updated?

Rating:
Strength
Process:
A full-text search for "directions" shows a few files containing that string in their names.
Clicking on the toplevel directory department shows the result.

Image:
DirectionsLastUpdated.jpg
Answer:
The file /department/directions.shtml was last modified on 2002/8/31, 03:43:19

Which are the popular webpages?

Alternative 1:

Rating:
Strength
Process:
In order to find the most popular files we simply zoom on the highest values of the attribute hitCount.
Image:
Click to see video
Answer:
It turns out that /index.html and /index.shtml have the most hits, which is not very surprising

Alternative 2:

Rating:
Strength
Process:
A more interesting question is to find the most popular pdf-files. 
To accomplish this, we simply double-click pdf before zooming on the highest values of hitCount.
Image:
Click to see video
Answer:

Alternative 3:

Rating:
Strength
Process:
We can also use an attribute-dependent column width here.
In the image below, the width of each cell is proportional to the hitCount.
So large cells represent popular web pages.
Moreover, we have defined a derived attribute Sum(hitCount) per toplevel directory and we have sorted the table by this new attribute.

 Image:
image007.png
Answer:
We can observe that projects is the most popular top level directory and hcil is the most popular project, but mainly because of the banner-images in gif-format.

Are there some labs more popular than others?

Rating:
Strength
Process:
Zoom on projects and define Sum(hitCount) per level 2.
Image:
PopularLabs.jpg
Answer:
hcil, plus, and hpsl are the most popular labs.

Which areas are getting more popular? Less popular?

Rating:
Strength
Process:
We define the attribute hitCount as the color giving attribute (indicated by the traffic lights left of its attribute name).
Then we zoom into weeks A and E.
This shows cells with more hits than the average in green and cells with fewer hits than the average value in red.
Comparing weeks A and E the change of color can indicate that the number of hits increased or decreased.
Image:
IncreasedHitCounts.jpg
Answer:
The toplevel directory class gets about three times more hits in week E than in week A (from 2.11 to 6.46).
Directory users decreased from 3.06 to 2.87 in average.

Are new pages more popular that old pages?

Rating:
Strength
Process:
We defined a new attribute creation year derived from creation day,
and another derived attribute
Average(hitCount) per creation year:
Image:
image015.png
Answer:
It turned out that the average hit count in general is lower for old files. The year 2000 is an exception. Another exception are the 3 files created in 1993. These have an average hit count of about 60. Mainly because the file /users/samir/khuller.gif was hit 129 times.

Which old pages are popular?

Rating:
Strength
Process:
We zoomed on the files created in 1999 or before and on the highest hit counts.
This showed some banner images in GIF-format.
It is more interesting to look for popular papers. Therefore we focused on ps- and pdf-files.
Moreover, we defined the derived attribute Sum(hitCount) per file path to sum up the hit counts of the 5 weeks for each file.
Image:
OldPopular.jpg
Answer:
See image.

What proportion of the pages are never used?

Rating:
Strength
Process:
The distribution of hit counts can be directly seen.
Image:
HitCounts.jpg
Answer:
About 50% of the pages are never used.

What proportion of the pages are seldom used?

Rating:
Strength
Process:
The distribution of hit counts can be directly seen.
Image:
HitCounts.jpg
Answer:
About 90% of the pages are used 5 times or less per week.